Yale Researchers Prove That ACID Is Scalable 272
An anonymous reader writes "The has been a lot of buzz in the industry lately about NoSQL databases helping Twitter, Amazon, and Digg scale their transactional workloads. But there has been some recent pushback from database luminaries such as Michael Stonebraker. Now, a couple of researchers at Yale University claim that NoSQL is no longer necessary now that they have scaled traditional ACID compliant database systems."
Pfah. (Score:5, Interesting)
NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google and Walmart that make the sites that built these databases in desperation look positively tiny.
Digg's engineers wear clown shoes to work.
Re:Pfah. (Score:4, Insightful)
It was newer database size which were the problem but the number of queries per second(Aka performance) which could be executed.
You can run a Google size database from MySQL, but you can't use to MySQL* to implement a search solution with performance like Google, without requiring much much much hardware.
*Or an other sql database.
Re:Pfah. (Score:4, Insightful)
Well, and if you don't need it [the guarantees of ACID], why pay for it? I mean, if you have to spend any amount of time thinking about "How do I make that work?" that's a cost.
Whereas if all you care about is updating individual records without global consistency, well, don't enforce global consistency.
Re:Pfah. (Score:5, Insightful)
NoSQL is not really about scalability, it is about modelling your data the same way your application does.
There is a strong disconnect between the way SQL represents data and the way traditional programming languages do. While we've come up with some clever solutions like ORM to alleviate the problem, why not just store the data directly without any mapping?
I am not suggesting that SQL is never the right tool for the job, but it most certainly is not the right tool for every job. It is good to have many different kinds of hammers, and perhaps even a screwdriver or two.
Re:Pfah. (Score:5, Insightful)
There is a strong disconnect between the way SQL represents data and the way traditional programming languages do.
Yes but there is a strong disconnect between computer RAM and information. Computer RAM contains DATA; information comes in associated tables. Relational databases represent data in tables with indexes, keys, etc. A Person is unique (has a unique ID), but they may share First Name, Last Name, and even Address (junior/senior in same household). There are many Races, and a Person will be of a given Race (or mix, but this is horribly difficult to index anyway). A Person will own a specific Car; that Car, in turn, will be a particular Make-Model-Year-Trim, which itself is a hierarchy of tables (Trim and Year are pretty separate, Model however will be of a particular Make, while a particular car available is going to be Model-Year-Trim).
Indexing and relating data in this way turns it into information, which is what we want and need. Separating the data eliminates redundancies and lets us use fewer buffers along the way, crunching down smaller tables and making fast comparisons to small-size keys before we even reference big, complex tables. Meanwhile, we're still essentially asking questions like "Find me all people who own a 1996-2010 Year Toyota Prius." Someone might own 15 cars, so we're looking in the table of all individual Cars with MYT where table MYT.Model = (Toyota Prius) and .Year is between 1996 and 2010, and pulling all entries in table Persons for each unique Cars.Owner = Persons.ID (an inner join).
Information theory versus programming. We're studying information here. We might have something more interesting to do than look in a giant array of Cars[VIN] = &Owners[Index]. For the actual data, the model we use makes sense; programmers get an API that says "Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer." That two-dimensional array is suitable for programming logic to manipulate specific structured data; extracting that data from the huge store of structured information is complex, but handled by a front-end that has its own language. You tell that front-end to find this data based on these parameters and string it together; it does tons of programming shit to search, sort, select, copy, and structure the data for you.
Re:Pfah. (Score:5, Insightful)
Re:Pfah. (Score:2)
This is why I call myself a database programmer. I'm not a DBA, never have been and don't want to be. I understand how to make the database do what it needs to do. At a high level, I understand how data is stored to disk, but I don't really care about that (that's a DBAs job). I also understand at a high level the questions that an application developer needs to ask (not a DBAs job at all). I bridge the gap and write code (sprocs, triggers, functions, etc.) to support the app. I tune queries and db code to support the database server (and the user experience). I have yet to meet an English question about data that can't be answered in performant SQL.
I'm no opposed to noSQL (even with a nick like SQLGuru), but like was said earlier, a sledge hammer, a tack hammer, and a claw hammer all have their appropriate uses......some even drive screws in better than others. SQL is still the right tool for business applications (where aggregating and reporting is extremely important). Right tool + right job = cake (it's not a lie). Wrong tool + right job = frustration.
Re:Pfah. (Score:3, Insightful)
Unless you're writing the code for the database engine, you are NOT a database programmer, you're an application programmer...
Book: SQL Antipatterns (Score:3, Informative)
Re:Pfah. (Score:5, Informative)
You have no idea what you're talking about, probably because your brain has been irreversibly warped by MySQL. Concurrent writing is widely-supported.
Hint: MVCC [wikipedia.org].
Re:Pfah. (Score:5, Informative)
An ACID compliant RDBMS can't even get read access to the user, car, friend, picture and pet_survey_answer table set as long as any of the million users of the system is making a change to his data, even if the application only locks one table at a time for write access, let alone the problem of a million users trying to gain write access to the same table at the same time.
Wow. Just wow. Any serious ACID complient RDBMS can do that with no problem.
I thought it was an array of structs (Score:2)
"Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer."
I thought it was more like an array of structs, where each array entry is a row and each struct member is a column. In non-C you might say each row is an object, each field-of-a-class is a column (where class : table) and each field-of-an-object is a single cell.
Then the cartesian product operation on tables of types T1 and T2 (respectively) has a type which is the product of T1 and T2, and everything matches up neatly.
Re:I thought it was an array of structs (Score:2)
You mean a linked list. I'm not sure for your particular API.
The issue here is that you get rows that are effectively struct { char[]; int; long int; int; double; char[]; char[5]; }; which you can do. What you can also do is void* result[][], where (*(result[row][column])) (note that the inner set of parenthesis is optional in this case, but syntactically valid and more visually clear) points to the correct data.
Working with arbitrary data gathered from an arbitrary information set is a pain. Consider the task, though: Keeping a pile of information and performing arbitrary operations on arbitrary subsets of that information, including arbitrary entries (rows) and/or arbitrary attributes (columns).
Re:Pfah. (Score:3, Insightful)
Yeah, ask me a specific structured question and I'll give you a two-dimensional array to work with as an answer.
That's fine until someone asks you an unstructured question for which a two-dimensional array cannot contain the answer.
Like, for example, 'Here's an ordered DOM tree of nodes each containing tags, subtrees and/or chunks of CDATA'.
Or 'Here is a set of objects each of which contain their own custom properties not found in others.'
Not every form of useful information in the real world is strictly typeful and represents a well-formed relation over finite domains.
Re:Pfah. (Score:5, Informative)
A decade plus ago, and that would be true.
Standard SQL from SQL-99 on will, in fact, do this quite easily with via recursive Common Table Expressions. Now, some SQL-based DBMSs don't support enough of the standard to use this, but, current versions of, I believe, DB2, Firebird, PostgreSQL, and SQL Server all implement standard CTEs well enough to do those examples in SQL fairly directly, and Oracle has its own proprietary syntax (CONNECT BY) that works for the examples that you pose, though its less general than SQL-99 recursive CTEs.
Re:Pfah. (Score:5, Interesting)
Totally agree. Only problem is writing recursive CTE queries is beyond most programmers. Hell, a lot of programmers struggle with anything but simple inner joins.
IMHO CTE's are one of the most underused and powerful features of SQL. Not just for recursive queries, but for bridging the gap between functional and procedural programming.
I write all my complex queries as a series of simple CTE's now - each CTE gets me one step closer to the actual query I need, and the magic of the query optimizer combines them all into a single query plan. Makes testing, debugging and maintaining a complex query about a million times easier.
Re:Pfah. (Score:3, Informative)
SQL, as such, is declarative. Many RDBMSs include, in addition to SQL, an SQL-derived procedural scripting language (Oracle's PL/SQL, and so on.)
Re:Pfah. (Score:4, Interesting)
That is an excellent question for a DBA evaluation exercise.
So...
Efficient SQL Usage == Programmer + DBA
Efficient NoSQL Usage == Programmer
Thank you for making the case for NoSQL so clearly.
Re:Pfah. (Score:3, Insightful)
That depends. If I'm storing video data I don't want a relational database. A small-scale family tree might be good in a proprietary format. A large-scale family tree might also be good in a proprietary format. The Windows registry is inherently hierarchical and needs a non-relational model, just like file systems (quit arguing that file systems should be relational DBs; the current model is fine).
A large-scale family tree that I need to use to look up other information with absolute identity (i.e. there are 15 James Clyde Simmons in the world, 7 in my city somehow, and 3 in my zip code!) needs to at least sync its individual identifiers with the primary key of a RDMBS holding all the other data in any case where relational analysis is also needed i.e. find me all PERSONS with $ATTRIBUTE. Keeping these two things in absolute sync requires a specialized database engine; but you can write program code that fakes it for all useful cases if you keep the primary common identifier unique and static.
There are going to be tasks where an RDBMS is excellent and anything else is going to be complete failure. College information systems, forever, have to track students vs student IDs vs all completed courses and grades vs when those courses were completed vs what courses the student is enrolled in now vs if they've paid for their tuition... this is the wrong kind of information to list line by line (flatfile) or hierarchically. Maybe I want to see everyone enrolled in MATH314, or everyone enrolled in MATH314 class DXA, or everyone enrolled in MATH314 on Middlesex campus. Maybe I want to see all courses James Peak is enrolled in, or has enrolled in ever. For these tasks, you need an RDBMS.
There are also going to be good flatfile cases-- MP3s, video files, XCF, etc. As well, there will be stores of information that must fall into hierarchical organization-- file systems, geneology databases, the Windows registry. These should optimally not use an RDBMS structure.
There will be tasks that operate on one set of data but bring a corner case that benefits from another method of organization. For example, looking through a database at an insurance company to check for dependents (parents/children/spouses). Of course hierarchical databases might be better for this operation; but all the information and all operations you'll ever do is going to go better in an RDBMS, and any other storage method will require either tons of cross-indexing (to the point of implementing a BAD RDBMS) or lots of memory and time to do 0.06 second queries in 10 minutes. Too slow, too broken. The corner case operations cause trouble, but what can you do?
Re:Pfah. (Score:2)
Re:Pfah. (Score:2)
For this reason I suggest that app language designers work on better fitting RDBMS and SQL rather than the other way around (at least for data-driven apps). OOP may be nice, but it inherently conflicts with relational concepts and patterns. Generally, one is based around attribute-handling idioms and the other behavior-handling idioms. OOP also tends to be nested, hierarchical, and/or graph-shaped; while relational is set-centric. Either you de-emphasize one or the other, or deal with complicated and expensive translation layers. Barring some revolutionary breakthrough, something has to give. Right now it's like men wearing womens' underwear and vice-verse.
Re:Pfah. (Score:3, Funny)
Right now it's like men wearing womens' underwear and vice-verse.
You mean it makes me feel pretty?
Re:Pfah. (Score:3, Insightful)
NoSQL is not really about scalability, it is about modelling your data the same way your application does.
I 100% agree. Earlier this year I created a moved a prototype application built around SQLite and flat files to MongoDB. MongoDB is SQL-like in its ability to have queries and indexes; but it stores its data in a way that doesn't require me to deconstruct all of my data structures into tables. This dramatically reduced complexity in code that used to deal with 5-6 SQLite tables. In the case of MongoDB, I was able to replace 5-6 tables with a single collection of structured documents. MongoDB lets me write queries against data that's deeply-nested, yet it can return the full data structure so I don't have the performance hit (and programmer time hit) of running (and writing) many queries to hydrate data structures around foreign key relationships.
The other advantage to MongoDB is that its schemaless approach makes it much easier to handle inheritance. I can have documents with common parts for base classes, and varying parts for child classes. This is much harder in SQL, because I either need to design a super-table that can handle all variations of the base class, or I need to use a multi-join around all potential classes that I can query. MongoDB's document-based approach, as opposed to SQL's table approach, lets me write a single query that can handle future subclassing of the data, and future variations of the data.
Whose data is it? (Score:4, Insightful)
but it stores its data in a way that doesn't require me to deconstruct all of my data structures into tables.
I take it this is not business-type data? Otherwise you're doing it backwards. Start with your Entity-Relationship diagrams, devolve into logical than physical data models, and THEN start programming.
I forget who said it but it's true: The data belongs to the business, not to the application. The data should be structured and stored in a way that it will still be readable years after your program has become obsolete. (Unless it's data that has a short "best before" date.)
--
.nosig
Re:Pfah. (Score:3, Informative)
So, remember, NoSQL means that's anything but SQL. It's not a standard; rather, it's an honest effort to try to experiment with different database techniques where traditional SQL just isn't meeting an industry need. Key-value databases aren't going to satisfy the "give me how many widgets we sold in June to evil inventors in the tri-state area" need; but they do satisfy the scalability need for sites that have millions of concurrent users.
Regarding Mongo, the NoSQL database that I use, it can answer the "give me how many widgets we sold in June to evil inventors in the tri-state area." Basically, instead of having 100 tables with foreign key relationships, you'll have 10 collections of "documents," which are really just data structures. You can query deeply into data structures and return partial data structures.
Let's assume I have an "invoices" collection. Each invoice has an array of "line items", and each item has a count. I can do the following in Mongo:
Again, NoSQL isn't a standard. It's basically experimenting with different ways of having a database with the hopes of finding one that's easier to work with. Mongo is a lot closer to SQL then things like Key-Value databases.
Re:Pfah. (Score:2, Insightful)
NoSQL has a lot to do with scalability. Sure there's other reasons, but not enough to recommend them over hash databases. Hash databases have been around for decades which do what you propose and a lot more, their main con is the lack of scalability -- hence NoSQL. BerkeleyDB is an example, but it's a list to huge to continue..
Re:Pfah. (Score:4, Interesting)
``There is a strong disconnect between the way SQL represents data and the way traditional programming languages do.''
I agree, but ...
``While we've come up with some clever solutions like ORM to alleviate the problem,''
I don't think ORM alleviates the problem so much as entrenches it. The classes-and-instances object model and the relational model are different, but can be expressed in one another. Object-relational mapping makes this easy by pretending the models are the same, and doing the mapping behind the scenes. This works for some cases, but if you want to get the best performance, you have to express things in a way that takes into account the efficiency considerations of the actual implementation. With ORM, you run into the situation where what is most succinct to express in code is not necessarily what is most efficient in terms of disk access and network resource usage. So, for efficiency reasons, you end up breaking the abstractions that your ORM provided ...
``why not just store the data directly without any mapping?''
There isn't really such a thing as "without any mapping". However, you can ensure that the constructs your API provides are equivalent to what you can efficiently fetch or store in your data store. Since typical RDBMSs are usually optimized to execute typical SQL queries efficiently, SQL is actually a fairly good starting point. You can optimize this by creating indices to speed up common operations, and by tuning your RDBMS to speed up common operations. And, no doubt, you can do even better by creating custom shortcuts for specific needs of your application.
This is sort of what so-called NoSQL databases do: they are optimized for specific scenarios, and thus may outperform stock RDBMSs that are optimized for "we don't know what you want to do, so we try to make everything reasonably fast". It's also worth noting that NoSQL systems often return stale data or even allow inconsistencies in order to improve performance. By contrast, the strength of a good relational database is preserving the integrity of your data no matter what happens. Different tools for different jobs - or at least, different optimizations for different scenarios.
Re:Pfah. (Score:5, Insightful)
NoSQL is not really about scalability, it is about modelling your data the same way your application does.
I've actually been in the business long enough to remember when relational databases were the new thing. What people seem to forget is that modeling your data in a different way than your application does *was the whole point*. The idea was to make data a reusable resource *across applications*. Of course, that turned out to be a lot harder than we thought it would be. Philosophically, one might well ask whether it is possible to understand data at all apart from its intended applications. Of course, by the time we'd figured that out, a whole new generation was coming up trying to create a Semantic Web.
I basically agree that SQL isn't always the right tool for the job. I happen to think certain aspects of the relational model are somewhat broken (e.g. composite keys), and SQL is a pretty crappy query language in any case. But I think because RDBMSs are a mature technology, recently trained programmers don't bother to understand them, and cover that lack of understanding by pooh-pooh-ing the stuff that's over their head. I went through a patch a few years ago where I was interviewing programming candidates who had XML coming out of their ears but hadn't the foggiest idea of what "NULL" means in the relational model. Naturally they had all kinds of problems on the relational end of things, and tended to view the RDBMS as a kind of pitfall in which bad things inexplicably happen. Consequently, they tended to think of the database as simply a backing store for the application *they* were working on. In some cases this is acceptable, but one often sees abominable schema that are the product of ignorance, pure and simple.
Naturally, non-relational systems are most attractive where performance is at a higher premium than flexibility. This characterizes many web applications that do a small number of relatively simple things, but to do it on a scale that takes special expertise to achieve using a relational model. That was very much the case at the beginning of the relational era, when applications tended to be narrower in scope and query optimization primitive. You thought of order line items as "part-of" an order, whereas in relational thinking they could just as easily be considered attributes of products. This made the programmer's job a lot easier, so long as the RDBMS could process invoices fast enough to make the users happy.
Re:Pfah. (Score:5, Interesting)
NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google
Google uses BigTable, a NoSQL database.
Re:Pfah. (Score:3, Insightful)
Google initially used MySQL for Adwords, tried to switch away from it, and then switched back (if I recall correctly). Your Googling May Vary.
Re:Pfah. (Score:5, Funny)
"Your Googling May Vary."
Yes, that is exactly the problem with NoSQL.
Re:Pfah. (Score:4, Funny)
Funny. Insightful. Informative. So many options with your post. I'm sure at least one moderator will get it figured out.
Re:Pfah. (Score:5, Funny)
What? Oracle too slow? How dare you besmirch the all-powerful Larry Ellison. We switched from a mainframe environment which handled all our sales data to an Oralce-based ERP system. I'll show you how fast this puppy now runs. Let me show you our sales data for the last month...
Hang on, I'll get the answer in a minute...
Bear with me, it will be here soon...
Here's a bottle of Mountain Dew while you wait...
Can I get you anything to snack on? M&M's? Doritos? A Snickers bar perhaps?
Re:Pfah. (Score:2)
Re:Pfah. (Score:2)
Depends for what part, but Walmart's site runs at least partly on a "NoSQL" (I use the term loosely in this case) system.
Re:Pfah. (Score:5, Insightful)
Database size was never the main driving force beyond the new move toward NoSQL databases. Support for distributed architectures is. In part, this is about handling lots of queries rather than handling lots of data; it also -- particularly if you are Google -- deals with latency when the consumers of data are widely distributed geographically.
And note that one of the companies that is heavily involved in building, using, and supplying non-SQL distributed databases is Google, who, as you so well point out, is very much aware of both the capabilities and limits of scaling with current relational DBs.
This new research may offer new prospects for better databases in the future -- but TFA indicates that the new design has a limitation which seems common in distributed, strongly-consistent system "It turns out that the deterministic scheme performs horribly in disk-based environments".
In fact, given that it proposes strong consistency, distribution, and relies on in-memory operation for performance, it sounds a lot like existing distributed, strongly-consistent systems based around the Paxos algorithm, like Scalaris. And it seems likely to face the same criticism from those who think that durability requires disk-based persistence, and that replacing storage on disks (which, one should keep in mind, can also fail) with storage in-memory simultaneously on a sufficient number of servers (which, yes, could all simultaneously fail, but durability is never absolute, its at best a matter of the degree to which data is protected against probable simultaneous combinations of failures.)
So -- reading only the blog post that is TFA announcing the paper and not the paper itself yet -- I don't get the impression that this is necessary are giant leap forward, though more work on distributed, strongly-consistent databases is certainly a good thing.
Re:Pfah. (Score:2, Funny)
After all, MySql is why slashdot is so relia~ `} v* m& + ' ,
Re:Pfah. (Score:2)
NoSQL never was necessary. Traditional SQL database - not just terascale, but even simple ones like MySQL - regularly deal with data volumes at Google and Walmart that make the sites that built these databases in desperation look positively tiny.
It isn't data volume that is the problem. It is often data organization. Traditional SQL databases are row stores. For some applications that is not a good way to store data. Column stores make more sense in data warehousing, for example. Michael Stonebraker has blogged about this a few times at the same blog site cited by the submitter.
Re:Pfah. (Score:5, Informative)
digg does not need to worry anymore (Score:5, Funny)
Re:digg does not need to worry anymore (Score:3, Insightful)
offtopic:
Considering how fanatical digg users can be, I can't possibly imagine why they thought it was a good idea to implement the changes they've made.
Re:digg does not need to worry anymore (Score:3, Interesting)
Because the entire site had been completely overwhelmed by spammers? Digg went from a great site to go see whats new to a glorified RSS feed for cracked.com , college humor and reddit. They had to change something,
Re:digg does not need to worry anymore (Score:5, Funny)
Re:digg does not need to worry anymore (Score:5, Insightful)
Re:digg does not need to worry anymore (Score:3, Funny)
And they would have gotten away with it if it wasn't for those meddling kids!
Berkeley DB (Score:4, Funny)
Didn't Berkeley prove back in the 60s and 70s that acid was scalable?
Re:Berkeley DB (Score:2)
At the very least, they proved it was salable...
Interesting thesis (Score:5, Interesting)
In essence, TFA claims that if the traditional ACID guarantee "if three transactions (let's call them A, B and C) are active ... the resulting database state will be the same as if it had run them one-by-one. No promises are made, however, about which particular order execution it will be equivalent to: A-B-C, B-A-C, A-C-B" is not abandoned (as in NoSQL systems), but is even strengthened to a guarantee that the result will always be as if they arrived in A-B-C order, then it solves all kinds of possible replication problems, requires less networking between the many servers involved, and allows for high scaling while also keeping all the integrity constraints.
Re:Interesting thesis (Score:2)
Determinism solves many things in DB design that's why things like WITH SCHEMABINDING for views and user defined functions in MS SQL make things run so much faster. With over 40 years of RDMS design, it's odd that this path has never been gone down before. But the whole turning "out that the deterministic scheme performs horribly in disk-based environments" makes perfect sense if this is something that scales very well in high memory environments that didn't exist until now.
Now THIS is news for nerds, it's too bad I had to scroll through so many LSD/Acid (hurr hurr drugs) jokes to get down to a comment of someone who actually read this.
Possible != Practical (Score:4, Insightful)
A bigger issue may be the cost of ACID even if it can in theory scale. Supporting ACID is not free. A free web service may be able to afford losing say 1 out of 10,000 web transactions. Banks cannot do it, but Google Experiments can. The extra expense of big-iron ACID may not make up for the relatively minor cost of losing an occasional transaction or customer. It's a business decision.
Re:Possible != Practical (Score:2)
Re:Possible != Practical (Score:3, Insightful)
Typically the NoSQL approach just shifts the problems from the database layer to the application programmer - if it's simply ignored, a typical app can't cope with unpredictable/corrupt data being returned from db, and results in weird bugreports that cost a lot of development time to find and fix; and with these fixes parts of the ACID compliance are simply re-implemented in the app layer.
You gain some performance of the db, you lose some (hopefully less) performance in the app, and it costs you additional complexity and programmer-time in the app.
ACID does not imply SQL (Score:3, Insightful)
For instance, Neo4J is a scalable graph-based "nosql" DB with ACID.
NoSQL is also about arbitrary schemas (Score:2)
NoSQL's two big features are scalability and the arbitrary schemas. While the paper covers the first (though I still think map/reduce has its place) NoSQL does do taxonomy-based (hierarchical) schema better. The only way to do that in SQL is to have a property table, where the parent object is a object RID, and a huge table of attached properties and values to that. You might be able to get your indexes to perform reasonably well, but only by duplicating the some data. And on top of that, just try writing a query for hierarchical data! You'll have sub-selects for each level of hierarchy. This means in order to to something relatively simple, like KPCOFGS of species classifications, you'll need a select and 6 sub-selects. At least that one is well defined to . If its not, you just don't know how many, and you have to write a recursive function to generate your select query, or process the results from it. Either way, you repeatedly consider 99% useless records at every level. True, you can cheat at this because there are always 7 levels. But that is not true for most other trees.
Re:NoSQL is also about arbitrary schemas (Score:2)
There is more than one way to do Hierarchical Query's, it just depends on the RDMS. Oracle has had it for years and SQL Server implemented it in the 2005 edition. You don't need sub-selects.
Re:NoSQL is also about arbitrary schemas (Score:2)
Oracle's CONNECT BY is much much slower than a custom index based on nested sets...Tell me something about default SQL implementations...
Sure. Default SQL implementations are going to be more feature rich to accommodate for a larger set of use cases than a custom implementation which can make use of domain specific shortcuts for performance gains.
TMTOWTDI ... just sayin'
"premature optimization is the root of all evil"
-Knuth
Re:NoSQL is also about arbitrary schemas (Score:2)
This is true if you use the Agency List Model for hierarchical data. Nested Set Models are a better solution to storing hierarchical data and are extremely fast and efficient for selecting arbitrarily deep nested data without tons of sub-selects. Though inserts are slow in theory (because you have to re-balance the tree) there are practical ways of inserting data so performance doesn't suffer.
See the MySQL site for their discussion of the Nested Set Model [mysql.com], this article [developersdex.com] on the same topic by Joe Celko and a question [sqlmonster.com] about insert performance in which Celko responds.
They answered the wrong question (Score:2, Insightful)
We knew ACID can scale already.
With enough money poured into it, and new implementations, ACID can scale.
They solved some problems with scaling out, not necessarily the problems with it scaling up. Scaling does not necessarily just mean replicas and quick failover -- it means good performance without millions spent on hardware too, in terms of overhead, storage requirements, storage performance, server performance.
NoSQL scales in certain cases less expensively, with less work, and doesn't require complicated DBM algorithms. The representation of data is also simpler, and requires less work to maintain than tables.
It's just a result of major existing SQL implementations being so expensive with large datasets, that sometimes it costs more in terms of performance and required hardware, than simply using NoSQL.
I also love this gem from the article:
If the system is also stripped of the right to arbitrarily abort transactions (system aborts typically occur for reasons such as node failure and deadlock), then problem (b) is also eliminated. ... given an initial database state and a sequence of transaction requests, there exists only one valid final state. In other words, determinism.
I suppose the authors are from a land where hard drive space is infinite, database server resources are always guaranteed ahead of time... I/Os never have unrecoverable errors, syscalls never return error codes, RAM is infinite, programs never crash.
The conclusion that ACID alone is the bottleneck is not necessarily true. The SQL language itself requires a complex implementation just to parse and implement queries, that can add latency.
Not Proven (Yet) (Score:2)
Just in case anybody else doesn't know... (Score:4, Informative)
From the Wikipedia Article (http://en.wikipedia.org/wiki/ACID [wikipedia.org])
"In computer science, ACID (atomicity, consistency, isolation, durability) is a set of properties that guarantee database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction."
NoSQL is about a lot of things. (Score:2, Interesting)
SQL syntax is dated and very obtuse. Just look at the different syntax between insert and an update. ...wouldn't you rather just have "save"?
Object-relational mapping is cumbersome and mis-matched in SQL. 1:many either yields n+1 queries or a monster cartesian product set. And, what about inheritance? It just doesn't jive.
It isn't about losing ACID- although not every purpose needs ACID. Your average shared drive filesystem isn't ACID, for example.
When you have anemic domains that aren't nailed down and need to be readily flexible without big re-designs, JSON-based No-SQL works very well.
When you want to avoid n+1 and have well-defined data needs with 4MB of data across your object graph, No-SQL works... very very well.
When you want to segregate the business services and its backing data store from the separate concern of BI, No-SQL keeps the riff-raff out of your data store.
It's different. It solves different problems. Keep your mind open.
Not NoACID, NoSchema (Score:3, Interesting)
Interesting article )and yes, I read the article), but the point of the NoSQL movement isn't so much about SQL, or ACID, as much as it is about Schema.
Most applications today are written in object-oriented languges like Java, C#, Ruby, etc... and most common frameworks in these languages use object-relational models to essentially 'unpack' the object into a relational model, and then reconstitute the objects on demand. this post [tedneward.com] explains the kinds of problems better than most.
NoSchema is about storing data closer to the format we process it in today. Key-Value pairs. XML. Sets and Lists. Object-Oriented data structures. This is about abstractions that make developers more productive. It is a tool in a toolbox, and useful in some circumstance and not in others.
SQL databases do not have to be the 'one persistence data mechanism to rules them all'. We don't need one; we need many that solve differing classes of problems well.
Relaying my comments from the blog (Score:2)
To achieve 'nonconcurrency' one needs to introduce a global ordering of transactions. Which WILL require a shared resource among ALL of the transactions. No way around it, sorry.
And what's funny, this resource some of the problems of ACID systems. However, there should be advantages (no need for rollbacks, etc.).
Besides, all of this doesn't tackle another advantage of NoSQL systems: working with HUGE amounts of data. There'll still be problems in ACID systems if data access requires communication between several storage nodes.
And don't forget the CAP theorem. You can't get Consistency, Atomicity and Partition Tolerance at the same time. RDBMS typically 'solve' it by dropping the requirement for the partition tolerance. Usually by using quorum sensing schemas, etc.
Field calls (Score:2)
This seems to be a reinvention of field calls, with a slightly different purpose.
ACID: Scale bigger, get slower (Score:3, Interesting)
TFA hints at this but doesn't come out and say it: the larger you scale, the more you swamp yourself with atomicity protocol overhead. If your database is geographically distributed, then you have to decide if atomicity is more important than forgoing the very large bills for the associated network usage. I suspect that this may explain a lot about why Google, Amazon, etc., went with NoSQL solutions.
Finally (Score:2)
Finally. I've been telling Bob that for years, but nooo, he insists that we keep using blotter paper and sour patch kids.
Summary (Score:5, Informative)
Short Summary:
We make some claims about scaling ACID databases, but then don't support them.
Longer summary:
We don't like NoSQL and enjoy making baseless cracks about it such as it being a "lazy" approach.
In our paper we demonstrate that our unconventional version of an ACID database scales better than a traditional ACID database in a specific environment, while merely throwing away some robustness guarantees and changing how transaction ordering works.
No direct comparison to any NoSQL implementation is made.
So yea, I'm not holding my breath for companies to start migrating away from NoSQL.
RDBMS is a golden hammer (Score:3, Insightful)
The reason that NoSQL is necessary is that ACID is not the only thing that developers need to think about. RDBMS was an innovative solution to the limitations of mainframe hierarchical databases circa 1970. Since then it has been the only game in town (At least for most enterprise software. Some of us do other things occasionally.)
It turns out that there are reasons to do things other ways, and having other options allows you to consider trade-offs. For many applications eventually consistent data scales just fine. For some applications, both big and small, an enterprise RDBMS is overkill. Why not just persist objects to a document store? Or even the file system?
The research is interesting, although I agree that we already knew we could scale the ACID paradigm. The conclusion is ridiculous. NoSQL has nothing to do with ACID, and it brings a richness to the conversation that has been missing for far too long. Like the Perl folks say, TMTOWTDI.
Re:I hate SQL and Databases in General... (Score:5, Insightful)
Because it works.
"It's old" is a terrible reason to replace something. Go back to your previous arguments an you have a case. After all, a Core i7 is based on a 1960's view of a problem with an enormous number of band-aids applied in the intervening years, but you don't seem too concerned with replacing that.
Re:I hate SQL and Databases in General... (Score:2)
Irony much?
You might wanna read my 2nd sentence. I know, I know. That's really far into my post.
Re:I hate SQL and Databases in General... (Score:2)
Uh.... If I never said that "being old" is a reason to replace something.... As you would have known if you actually read the sentence you quoted. Given this observation, what am I to say about the fact that the Core i7 is based on a 1960's view of a problem? Besides, the Core i7 ISN'T a 1960's based solution, but is based on a 1960's solution. There is an important difference between the two statements.
Everything we do in CS is based on work that goes back to 1939 and even earlier. However, in the case of the Core i7 (as an example) we CHANGE the approach to try and fix various problems we have with our performance.
Personally, I think going back to old ideas and realizing that we can now implement them better/faster/cleaner is a great way to approach many problems. That a solution is "old" isn't a problem, but it is a problem if a solution has known issues, and we just live with them.
Re:I hate SQL and Databases in General... (Score:2)
You said:
Why is it that we continue to use a technology based on a 1960's view of a problem
Your complaint: It's an old way of doing things.
My point: stick with everything else in your post, where you talk about efficiency and finding the language awkward. Your last sentence is summarized by "It's old, and we've thought of other things since then". That's not a useful argument.
When I explicitly referred to the rest of your post, that was kind of a clue that I read it.
As long as we pretend that CISC is new, for example.
Re:I hate SQL and Databases in General... (Score:5, Informative)
Spoken with proud ignorance.
Anyone who has properly scaled an application knows the database isn't the problem. If it was, it wouldn't take 12 applications servers to bring the thing to its knees. That said, most of your gripes equate to:
I am not a DBA and therefore I do not understand DBA and therefore I must complain.
Further SQL has nothing to do with ACID. AT ALL!
Re:I hate SQL and Databases in General... (Score:2)
I will absolutely agree that a well designed database does not have performance issues. However, I work in a segment of the industry that works with Health and Human services, and the databases have issues that make any reasonable DBA sick.
None the less, database throughput is always an issue. Our applications scale just fine for our needs (as you imply) but it remains that even if only one person is running one application against the database, the through put is just "meh" at best. This is because every operation requires queries against the database to move significant amounts of data from many different tables. Could we build applications with better performance? Absolutely, and using traditional Relational Databases too, if the Schema was properly designed.
All of this begs the question. The real question is why we use a technology that is so sensitive to bad schema design? Why use a technology that has such a high baseline overhead? Why use a technology that is so tedious? Why use a technology that is so hard to test?
Absolutely the developer doesn't have to build applications that inherit all these problems from the database. I have designed applications that sit on databases, and have none of these faults. But unfortunately not all the applications I work on were designed to avoid these issues.
Now you ARE right that I am not a DBA. But if I have a fault, it isn't because I don't understand the DBA, but that I don't understand the database....
And yeah, in my rant I criticized SQL and ACID and relational databases in general as if they were all the same. They are not, and in fact need not have any overlap at all. Still, I'll stand by my rant as an expression of my annoyance with various aspects (these and others) of this particular approach to the persistence problem.
Re:I hate SQL and Databases in General... (Score:2)
c'mon we use web services and only a few people complain about the inefficiencies there, we use XML and only some people complain about sprawling XML documents you can get.
You need to go learn a bit about DBs. SQL is pretty easy, once you've grasped the list-based concepts behind it. Stick to the simple bits and you're 90% done. They're not as bad as you think - its just your ignorance that's confusing you.
All technology suffers from the flaws you point out, all technology is fragile and easy to create total crap out of. (I know, I've worked with some 'professional' developers who make the most godawful mess, some of them even think they really are god's gift to coding).
DBs incidentally are one of those strange technologies where a 'clean, elegant and well designed' schema is a bad thing. If you over-normalise a DB performance will suffer, as will the code you have to write to use it. If you cobble everything into a few tables, it actually goes faster and is easier to code against. Strange, but true.
Re:I hate SQL and Databases in General... (Score:2)
Maybe it is the fact that RDBS based solutions are too fragile and are too often crap. We need to develop in representations that make sense to developers, and have the right sorts of compiler technologies and tools that build the proper run time representations for performance.
That you CAN build manageable/fast/testable/efficient applications is only the first step.
The second is wringing manageable/fast/testable/efficient applications out of mere mediocre developers.
Re:I hate SQL and Databases in General... (Score:3, Insightful)
All of this begs the question. The real question is why we use a technology that is so sensitive to bad schema design? Why use a technology that has such a high baseline overhead? Why use a technology that is so tedious? Why use a technology that is so hard to test?
Because fairly consistently, for the past forty years, every time someone says they've created something better than SQL and released to the market, the market proves them woefully and completely wrong. As such, as much as people piss and moan about SQL, SQL has consistently proven to be an excellent, general purpose solution and amazingly poorly understood by the masses. And solutions such as MySQL has only made things worse. That's not to say there are not superior niche solutions, only that SQL is one of the few database technologies which has continued to survive for decades as a general purpose solution, and rightfully so.
Its like the world suddenly doing their own plumbing, framing, and mechanical work and then proudly exclaiming the state of architecture and the car industry stinks because the world is falling apart around them. In reality, that means we need far more qualified DBAs and far fewer people who can barely spell, "SQL", designing and condemning the world around us.
Its literally been years since I've run into a qualified DBA, despite the fact "DBA" was part of their title. Turns out, being able to spell, "DBA" is all too often enough to qualify one for such a position. And don't get me started on the all the more common case of people who don't even know what a DBA does and yet they are responsible for actually creating the schema/data model.
Re:I hate SQL and Databases in General... (Score:3, Interesting)
All of this begs the question. The real question is why we use a technology that is so sensitive to bad schema design? Why use a technology that has such a high baseline overhead? Why use a technology that is so tedious? Why use a technology that is so hard to test?
Those statements could be applied to any technology that's being used inappropriately. Why are our programs so sensitive to bad algorithm design?
Re:I hate SQL and Databases in General... (Score:2)
Normalization [simple-talk.com] is not just some plot by database programmer to annoy application programmers (That is merely a satisfying side effect!)
Re:I hate SQL and Databases in General... (Score:2)
Actually, if you look at set theory and declarative languages, SQL is coming to more traditionally procedural environments. (MS's LINQ, for example.) It's an amazing language, good at what it's supposed to do. You could nearly complain the same about XML transforms as SQL. They just collect & format data. It's the programmers who make it complex.
Unavoidable bottlenecks in systems come from storage, searches and transforms. If you want to remove the DB from the equation, what layer of your system should be performing these things?
BTW: The math in set theory hasn't changed since the 1960's, it doesn't "get old" and need replacing. And you should learn to spell COBOL, your rants will appear more credible.
Re:I hate SQL and Databases in General... (Score:2)
Its worth noting that, in additional to the arguments from proponents of non-relational databases, SQL also gets criticism from proponents of actually doing set theory right (e.g., Date and Darwen.)
Really, SQL and the databases using it are shaped as much by optimization of disk-based storage using popular computing architectures of the time at which it took shape as any mathematical model of data.
As computing architectures and performance attributes (not speed, but relative costs of different access patterns) of storage media change, underlying database implementations and the languages that best leverage them may change, even when you want to be generally guided by set theory.
You hate what you don't understand (Score:5, Insightful)
Re:I hate SQL and Databases in General... (Score:2)
The parent is not a troll, it is spot on. The problem is that the database backend and the language frontend are tied together. To invent a new query language you need to invent a database backend to go with it, and you can't try out a new query language on an existing database deployment. Similarly, any innovations in the database backend are hampered by the limited syntax of SQL. If you can't make a small extensions to SQL to get it working, then you can forget about implementing it at all. This pretty much means game over for any database innovations.
Even Relational Algebra is infinitely easier to understand than the pseudo-English mess that is SQL. Much like even Haskell is easier to read than COBOL.
Re:I hate SQL and Databases in General... (Score:2)
Any decent framework abstracts out the SQL syntax for you in a nice manner (say, ARel in the Rails 3.0 framework is quite nice) , but gain a lot of compatibility by using SQL, allowing to choose from engines from SQLite in a flat file to Oracle on a cluster.
Hear, hear. (Score:2)
Yes, I'd like to be able to work with RDMBS data in REAL languages, not in ugly SQL or even more uglier DB internal languages.
DB tables can be represented with lists, on which composable pure (side-effect free) functions could operate. So JOINs can be expressed as list comprehensions. 'where' naturally is expressed as filters, etc. Care should be taken to maintain purity of functions used in queries, so they can be optimized efficiently.
LINQ in C# has beginnings of something similar.
PS: Am I describing Haskell, by any chance? :)
PPS: If your query requires complex complex and non-trivial optimizations by the RDBMS engine, then it's a bad query.
Re:I hate SQL and Databases in General... (Score:2)
Which problem? Storing your data, retrieving your data, modifying your data while guaranteeing transactional integrity, analyzing your data in aggregate, providing ways to recover your data, providing ways to reset your data to a previous state?
I'm not saying a traditional relational database is the perfect solution to everything, but it's silly to think that every approach will address the same set of concerns.
Re:I hate SQL and Databases in General... (Score:3, Informative)
What alternative have you seen that handles the same workload more efficiently? Flat files? I've seen plenty of database-related performance issues, but it's almost never inherent in the database - it's the idiot that wrote the lousy table-scanning code that's reading a couple rows out of a table with millions that's the problem.
If only you could start something like a "transaction", which you could then "roll back" after finishing the test, leaving the database in its original state. And if you could somehow "back up" the database and "restore" it on a test server, or under a different name. That would be awesome.
Checking your create/change scripts into source control is no more difficult than checking your C source in prior to compiling it.
While I don't totally disagree on this point, calling SQL "fixed" is a bit like saying C# and Java are the same. I promise you any meaty SQL Server code will not run on Oracle without very significant changes that will have to be done by someone that will cost you a lot of money (and likewise with Oracle to SQL Server). The capabilities vary wildly by platform, and the syntax is only identical for the simplest of CRUD statements.
I have to give this one a LOLWUT. If you're using a big RDBMS, it's likely a multi-user system. If you've got multiple users and connections, you want ACID. This isn't like imposing sorting overhead on data structures, it's like imposing the basic memory protection, process isolation, and filesystem durability you find in any competent operating system. If you want to see what it's like without those protections, go use Mac OS 9 for a week or so, or an Access database used by a few dozen people over a network.
Re:I hate SQL and Databases in General... (Score:2)
SQL is still SQL. SQL is fixed in a syntax and written with naming conventions and styles that can best be described as neo-Cobal.
Has relational algebra changed (no, it's complete)? Why would the basics of SQL change then? Sounds like you just don't understand relational math and structured informaion basics.
Re:I hate SQL and Databases in General... (Score:2)
Because SQL isn't a particularly faithful implementation of relational algebra?
Re:I hate SQL and Databases in General... (Score:2)
Re:I hate SQL and Databases in General... (Score:2)
I don't know if you are using SQL and "relational database" as equivalent... it seems that way. Anyhow a long time ago there were many different database solutions and most of them weren't relational databases. Then relational databases became popular and anything else almost seemed to disappear. I didn't really get this enormous shift because there are lots of domains where a relational database is not the natural representation of the information being modelled. But for most applications that most people are interested in relational databases work well and SQL represents the ideas behind relational databases quite well. So SQL is still here relatively unchanged decades later because nothing better has come along - apparently it fills its niche quite well - well enough that it hasn't been dislodged.
As for "neo Cobol" I think it was either Wirth or Dijkstra that said that typing speed was not the limiting factor in programming.
Re:I hate SQL and Databases in General... (Score:2)
The reason these other database types went away is because the relational db + SQL handles ad-hoc queries very well. In many if not most db applications that is a killer application.
Re:I hate SQL and Databases in General... (Score:2)
because on every application I have ever worked on, the Database has always been the performance bottleneck.
That means you need to fire your DBA and hire one that actually knows how to structure tables for performance.
Testing of DB applications is always a problem, because the running of tests generally changes the database, rendering tests unrepeatable without reseting the database.
And how is that different from testing any sort of application that has a persistent state?
Configuring applications to use this database or that database also ends up being a problem for most applications.
Really? What sort of libraries are you using? Every framework and DB library I've used has had a priority towards making it very easy to connect to a database. Usually, if you're only connecting to a single database, all you need to do is write your connection string in the appropriate file, and you're set. The only time you need to change that is when you're deploying your application from development to test, and from test to production.
Why is it that we continue to use a technology based on a 1960's view of a problem when clearly there ARE other solutions and ways to approach said problem?
Why do we use quicksort when there are other approaches to sorting?
Re:I hate SQL and Databases in General... (Score:2)
Why is it that we continue to use a technology based on a 1960's view of a problem when clearly there ARE other solutions and ways to approach said problem?
- seriously. I have this same problem with the entire DNA thing - it's too damn old and hard to understand.
I say we switch to a new paradigm - NoDNA.
From now on we don't need all those silly As and Gs and Ts and Cs and the entire twin helical strand idea, it's too freaking old. We must move on with times, so that we can implement NoDNA-DNA2 paradigm. It's going to be faster and easier on the eyes, it's going to have more Zaz in it. Zing, Zork, Kapowza, Mazooma in the bank!
It's just what cool kids would use.
Re:I hate SQL and Databases in General... (Score:2)
Oh no you don't like the syntax. That's a great reason to turn away from a technology that has been implemented enough times and had enough research to bring it to where it is today.
If you can get over not liking the syntax, the SQL standard is pretty awesome, as are many (but not all) actual databases that use it. It's powerful enough to let you do some pretty complex queries, it's reasonably easily optimisable (and there is a lot of literature about that) provided you're not using a lousy database engine (like MySQL which can't even handle basic relational calculus planning in a sane way), it's pretty fast, and it offers some great guarantees. I have absolutely no idea what you mean by being difficult to test - either you know how to test or you don't. SQL doesn't get in the way there. You have a production data store and a test data store; you test changes together.
Stored procedures are not so widely used because they're not standard enough. However, they're not hard to use with source code management - you're making the wrong argument.
Your last gripe is fair, and if you are *really* sure you don't need ACID overhead and you have a reasonable alternative database, go for it. You're giving up on all the other research that's gone into the common platform, but that's a tradeoff that might be worth it for some purposes.
Re:I hate SQL and Databases in General... (Score:4, Informative)
And don't get me started on stored procedures and the difficulty of using source code management with stored procedures.
That's easily solvable:
Stored procedures don't have to be any more difficult to manage than any other code.
Re:I hate SQL and Databases in General... (Score:2)
Absolutely true. I rewrote an application that had a 70 table database to use a simple tree structured representation - it ran two orders of magnitude faster and the code was easier to understand because the data representation conformed well to the actual problem domain. Relational databases are great but they aren't always the appropriate answer.
But as an aside I don't think hyperbole is the enemy of critical thinking - it is just a tool (perhaps weapon) the proper employment of which requires immensely more skill than most people possess.
Re:I hate SQL and Databases in General... (Score:2)
hmm. or you could have put an index on the right columns... which generally are implemented as tree structures. I'm sure your code was perfectly understandable to all who came after you, thinking they were working with a DB :)
Re:I hate SQL and Databases in General... (Score:2)
Re:I have to admit (Score:3, Funny)
I have a different image of ACID on Windows than they do.
Is it the image of Bill Gates in an Easter bunny outfit trying to force Steve Ballmer into a large cast iron kettle filled with Skittles and baby mice? 'Cause that's the image I have of ACID on Windows...